Week 03
Data Acquisition and Measurement

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-22

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this lecture, you will be able to:

  • Understand measurement properties and error
  • Work with census and government data
  • Learn probabilistic and non-probability sampling
  • Access data through APIs
  • Work with messy data in R

Measurement

Why Measurement Matters

Reducing the complexity of our world into a dataset requires us to:

  • Know what we give up when we do this
  • Be deliberate and thoughtful as we proceed
  • Consider who profits from the data collected
  • Think about whose power it reflects

Key Insight

Data are not neutral. Understanding, capturing, classifying, and naming data is an exercise in building a world and reflects power.

What is Measurement?

The International Organisation of Legal Metrology defines measurement as:

“The process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity”

This definition highlights several concerns:

  • Instrumentation – what we use to conduct the measurement
  • Units – the reference for comparison
  • Validity – is the measurement appropriate?
  • Reliability – is the measurement consistent?

Instrumentation

What we use to conduct the measurement determines what we can measure.

Historical Examples

  • Microscopes (16th century) led to observation of cells, bacteria
  • Timekeeping evolution: sundials → atomic clocks
  • Modern sports differentiate competitors to thousandths of a second

Modern Instruments

  • Surveys
  • Sensors (temperature, accelerometers)
  • Satellite imagery
  • Cookies and behavioural tracking
  • A/B testing frameworks

Properties of Measurements: Validity

Validity

Valid measurements are those where the quantity we are measuring is related to the estimand and research question of interest. It speaks to appropriateness.

Examples of validity challenges:

  • What makes a “one-hit wonder” in music?
  • How do we measure “intelligence” or university quality?
  • Why does WHO define maternal deaths as within 42 days of delivery?

For constructed measures, Flake and Fried (2020) recommend questioning:

  • The underlying construct of interest
  • The decision process that led to the measure
  • What alternatives were considered

Properties of Measurements: Reliability

Reliability

Reliability implies some degree of consistency – multiple measurements of one particular aspect, at one particular time, should be essentially the same.

Examples:

  • If two enumerators count shops on a street, their counts should match
  • If they differ, we should understand why (e.g., different instructions)
  • Migration data: in-migration of Country A from B should match out-migration of B to A

Measurement Error

Measurement error is the difference between the value we observe and the actual value.

Types of measurement error:

Censored Data

We have partial knowledge of the actual value

  • Right-censored: Know value is above some observed value
  • Left-censored: Know value is below some observed value

Other Types

  • Winsorised: We observe the actual value but change it to a less extreme one
  • Truncated: We do not even record extreme values

Visualising Censored and Truncated Data

set.seed(853)
newborn_weight <- tibble(
  weight = rep(rnorm(n = 1000, mean = 3.5, sd = 0.5), times = 3),
  measurement = rep(c("Actual", "Censored", "Truncated"), each = 1000)
)

newborn_weight <- newborn_weight |>
  mutate(weight = case_when(
    weight <= 2.75 & measurement == "Censored" ~ 2.75,
    weight >= 4.25 & measurement == "Truncated" ~ NA_real_,
    TRUE ~ weight
  ))

newborn_weight |>
  ggplot(aes(x = weight)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  facet_wrap(vars(measurement)) +
  theme_minimal() +
  labs(x = "Newborn Weight (kg)", y = "Count")

Comparing Means: The Impact of Measurement Error

newborn_weight |>
  summarise(mean = mean(weight, na.rm = TRUE), .by = measurement) |>
  kable(col.names = c("Measurement Type", "Mean Weight (kg)"), digits = 3)
Measurement Type Mean Weight (kg)
Actual 3.521
Censored 3.530
Truncated 3.455

Key Takeaway

Different types of measurement error introduce different biases:

  • Censored data can inflate means (pile-up at threshold)
  • Truncated data can deflate means (extreme values removed)

Missing Data

Regardless of how good our data acquisition process is, there will be missing data.

Important Distinction

A variable must be measured, or at least thought about, in order to be missing. With insufficient consideration, there is the danger of missing data that we do not even know are missing because the variables were never considered.

Types of missing data mechanisms:

  • MCAR – Missing Completely At Random
  • MAR – Missing At Random
  • MNAR – Missing Not At Random

Missing Data Mechanisms

MCAR

Missing Completely At Random

  • Rarely occurs
  • If it does, inference should still reflect the broader population

MAR

Missing At Random

  • Missingness depends on observed variables
  • Can often be addressed with appropriate modelling

MNAR

Missing Not At Random

  • Missingness depends on the unobserved value itself
  • Most problematic for inference

Non-response Matters

Gelman et al. (2016) argue that much of the changes in public opinion reported before elections are not people changing their mind, but differential non-response.

Censuses and Government Data

“Farmed Data”

We describe datasets that have been developed specifically for the purpose of being used as data as “farmed data”.

Characteristics of farmed datasets:

  • Typically well put together
  • Thoroughly documented
  • Work of collecting, preparing, and cleaning is mostly done for us
  • Conducted on a known release cycle

Examples:

  • Unemployment and inflation (monthly)
  • GDP (quarterly)
  • Census (every 5-10 years)

Census Data

Censuses of population have the power and financial resources of the state behind them.

  • The 2020 United States Census cost approximately US$15.6 billion
  • First modern census (naming every individual): Canada, 1666
  • Census data are not unimpeachable

Common Census Errors

  • Under-enumeration (not counting everyone)
  • Over-enumeration (counting people twice)
  • Misreporting

Accessing Australian Census Data

Australia conducts a census every five years through the Australian Bureau of Statistics (ABS).

Accessing ABS data in R:

# Install the readabs package
install.packages("readabs")
library(readabs)

# Read ABS time series data
unemployment <- read_abs(series_id = "A84423050A")

# Search for datasets
search_abs_tables("population")

TableBuilder

For detailed census data, the ABS provides TableBuilder – a free online tool for creating custom tables from census data.

IPUMS: International Census Data

IPUMS (Integrated Public Use Microdata Series) provides access to a wide range of datasets, including international census microdata.

Workflow for accessing IPUMS data:

  1. Create an account at ipums.org
  2. Select your sample (e.g., “2019 ACS”)
  3. Choose variables of interest
  4. Download in .dta format
  5. Read into R with haven::read_dta()

Citation Required

It is critical that we cite IPUMS datasets when we use them. See the download page for the appropriate citation.

Example: Cleaning IPUMS Data

library(haven)
library(tidyverse)

ipums_extract <- read_dta("usa_00015.dta")

cleaned_ipums <- ipums_extract |>
  select(stateicp, sex, age, educd) |>
  to_factor() |>
  mutate(age = as.numeric(age)) |>
  filter(age >= 18) |>
  rename(gender = sex) |>
  mutate(
    age_group = case_when(
      age <= 29 ~ "18-29",
      age <= 44 ~ "30-44",
      age <= 59 ~ "45-59",
      age >= 60 ~ "60+",
      TRUE ~ "Trouble"
    )
  )

Sampling Essentials

The Challenge of Sampling

“Statistics is the science of how to collect and analyse data and draw statements and conclusions about unknown populations.” – Wu and Thompson (2020)

We can never have all the data we would like.

Key Terminology

  • Target population: The collection of all items about which we would like to speak
  • Sampling frame: A list of all the items from the target population that we could get data about
  • Sample: The items from the sampling frame that we get data about

Target Population, Sampling Frame, and Sample

Example 1: Books ever written

  • Target population: All books ever written
  • Sampling frame: All books in the Library of Congress, or Google Books (25 million)
  • Sample: Books available through Project Gutenberg

Example 2: Brazilians in Germany

  • Target population: All Brazilians who live in Germany
  • Sampling frame: All Brazilians in Germany who have Facebook
  • Sample: Brazilians in Germany with Facebook whom we can gather data about

Probability vs Non-Probability Sampling

Probability Sampling

Every unit in the sampling frame has some known chance of being sampled, and the specific sample is obtained randomly.

  • Enables inference to the population
  • More expensive and difficult
  • Foundation of classical statistics

Non-Probability Sampling

Units are sampled based on convenience, quotas, judgement, or other non-random processes.

  • Cheaper and quicker
  • Cannot make formal probability statements
  • May be biased in various ways

Key Insight

The difference between probability and non-probability sampling is often one of degree rather than dichotomy.

Types of Probability Sampling

Simple Random Sampling

Every unit has the same chance of being included.

set.seed(853)
# 20% chance of being selected
sample(x = c("In", "Out"), size = 10, replace = TRUE, prob = c(0.2, 0.8))
 [1] "Out" "Out" "Out" "Out" "Out" "Out" "Out" "Out" "Out" "Out"
  • The purest form of random sampling
  • Easy to understand and implement
  • May be inefficient for rare populations

Systematic Sampling

Select every k-th unit after a random starting point.

set.seed(853)
# Random start between 1-5, then every 5th unit
starting_point <- sample(x = 1:5, size = 1)
selected_units <- seq.int(from = starting_point, to = 100, by = 5)
head(selected_units, 10)
 [1]  1  6 11 16 21 26 31 36 41 46

Example: Bowley (1913) used systematic sampling in Reading, England:

“One building in ten was marked throughout the local directory in alphabetical order of streets…”

Stratified Sampling

Divide population into mutually exclusive strata, then sample within each stratum.

set.seed(853)
data_with_strata <- tibble(
  unit = 1:100,
  strata = (unit - 1) %/% 10  # Groups of 10
)

# Sample 2 from each stratum
sampled <- data_with_strata |>
  slice_sample(n = 2, by = strata)

table(sampled$strata)

0 1 2 3 4 5 6 7 8 9 
2 2 2 2 2 2 2 2 2 2 

Use case: Ensuring adequate representation of small states/regions

Cluster Sampling

Select clusters of units, then sample all or some units within selected clusters.

set.seed(853)
# Select 2 clusters (out of 10)
picked_clusters <- sample(x = 0:9, size = 2)
picked_clusters
[1] 8 0
# All units in those clusters are included
selected <- tibble(unit = 1:100) |>
  mutate(cluster = (unit - 1) %/% 10) |>
  filter(cluster %in% picked_clusters)

nrow(selected)
[1] 20

Advantage: Can be cheaper due to geographic concentration

Non-Probability Sampling Methods

Method Description Trade-offs
Convenience Sample from easily accessible units Quick, cheap; potentially biased
Quota Fill predefined quotas for subgroups Ensures representation; not random within groups
Snowball Ask respondents to recruit others Reaches hidden populations; network bias
Respondent-driven Like snowball with compensation for recruiting Better for hidden populations; complex weighting

APIs

What is an API?

Application Programming Interface (API)

A website that is set up for another computer to be able to access it, rather than a person.

How it works:

  1. We provide a URL (with parameters)
  2. The server processes our request
  3. We receive data back (usually in JSON or XML format)

Example: Google Maps

  • Human navigation: scroll, click, drag
  • API access: https://www.google.com/maps/@-35.28,149.12,16z

Why Use APIs?

Advantages of APIs:

  • Data provider specifies what data they will provide
  • Terms of use are usually clear (rate limits, commercial use)
  • Less likely to be subject to unexpected changes
  • Legal and ethical considerations are clearer

Best Practice

When an API is available, we should try to use it rather than web scraping.

Using APIs with httr

library(httr)

# Make a GET request to the arXiv API
arxiv <- GET("http://export.arxiv.org/api/query?id_list=2111.09299")

# Check the status code
status_code(arxiv)  # 200 = success, 400 = error

# View the content
content(arxiv)

Common status codes:

  • 200: Success
  • 400: Bad request
  • 401: Unauthorised
  • 404: Not found
  • 429: Too many requests (rate limited)

API Response Formats: XML

XML (Extensible Markup Language) uses nested tags:

<entry>
  <title>Some Research Paper Title</title>
  <author>Jane Smith</author>
  <published>2023-01-15</published>
</entry>

Parsing XML in R:

library(xml2)

# Read and explore XML structure
content(arxiv) |>
  read_xml() |>
  html_structure()

# Extract specific elements
content(arxiv) |>
  read_xml() |>
  xml_child(search = 8) |>   # Navigate to entry
  xml_child(search = 4) |>   # Navigate to title
  xml_text()                  # Extract text

API Response Formats: JSON

JSON (JavaScript Object Notation) uses key-value pairs:

{
  "firstName": "Rohan",
  "lastName": "Alexander",
  "age": 36,
  "favFoods": {
    "first": "Pizza",
    "second": "Bagels"
  }
}

Parsing JSON in R:

library(jsonlite)

# Parse JSON from a Dataverse API
politics_datasets <- fromJSON(
  "https://demo.dataverse.org/api/search?q=politics"
)

# Access nested data
as_tibble(politics_datasets[["data"]][["items"]])

API Keys and Authentication

Many APIs require authentication via API keys.

Keep Your Keys Secret!

API keys should be kept private. Never commit them to GitHub.

Best practice: Use .Renviron

library(usethis)

# Open .Renviron file
edit_r_environ()

# Add your keys (use single quotes)
# SPOTIFY_CLIENT_ID = 'your_client_id_here'
# SPOTIFY_CLIENT_SECRET = 'your_secret_here'

# Save and restart R

Example: Spotify API with spotifyr

library(spotifyr)

# Get artist audio features (keys are read from .Renviron)
radiohead <- get_artist_audio_features("radiohead")

# Explore the data
radiohead |>
  select(track_name, album_name, duration_ms, valence) |>
  head()

Valence: Spotify’s measure of “musical positiveness” (0 to 1)

  • Higher values = more positive sounding
  • What does this actually measure?

Working with Messy Data

The Reality of Real Data

Real-world data is rarely clean and ready to use.

Common issues:

  • Missing values coded in unexpected ways (98, 99, -999, etc.)
  • Multiple variables combined into one column
  • Inconsistent units or scales
  • Human rounding and reporting errors
  • Measurement instrument limitations

Step 1: Look at the Data

Always start by examining your data!

# Examine a variable
table(sex)
#   1    2
# 749 1282

# Create more descriptive variable
male <- 2 - sex  # 0 for men, 1 for women
# Cross-tabulation reveals patterns
table(height_feet, height_inches)
#              height_inches
# height_feet   0  1  2  3  4  5  6  7  8  9 10 11 98 99
#           4   0  0  0  0  0  0  0  0  0  1  3 17  0  0
#           5  66 56 144...
#           9   0  0  0  0  0  0  0  0  0  0  0  0  0 26

Step 2: Identify Problems

# Simulated weight data showing common patterns
set.seed(853)
sample_weights <- c(
  100, 105, 110, 115, 120, 125, 130,  # Round numbers
  112, 118, 123, 127,                  # In-between
  200, 210,                            # "Nice" numbers
  998, 999                             # Missing codes
)

table(sample_weights)
sample_weights
100 105 110 112 115 118 120 123 125 127 130 200 210 998 999 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 

Patterns to notice:

  • People round to nearest 5 or 10
  • “Nice” numbers are overrepresented (e.g., 200 lbs, exactly 6 feet)
  • Missing data codes at extreme values

Step 3: Recode Missing Values

# Example: Recode impossible values as NA
height_inches <- c(0, 5, 11, 98, 99, 6, 4)
height_feet <- c(5, 5, 5, 9, 9, 6, 7)

# Recode missing
height_inches[height_inches > 11] <- NA
height_feet[height_feet >= 7] <- NA

# Combine into single variable
height <- 12 * height_feet + height_inches
height
[1] 60 65 71 NA NA 78 NA

Step 4: Handle Complex Variables

Example: Earnings with multiple response formats

# Some people gave exact amounts
table(is.na(earn_exact))
# FALSE  TRUE
#  1380   651

# Non-responders were asked categorical question
# Code 90 = nonresponse, Code 1 = "more than $100,000"

# Handle special codes
earn_approx[earn2 >= 90] <- NA
earn_approx[earn2 == 1] <- median(
  earn_exact[earn_exact > 100000], 
  na.rm = TRUE
) / 1000

# Combine
earn <- ifelse(is.na(earn_exact), 1000 * earn_approx, earn_exact)

Key R Functions for Messy Data

Function Purpose
table() Frequency counts, cross-tabulation
is.na() Check for missing values
complete.cases() Identify rows with no missing values
ifelse() Conditional recoding
case_when() Multiple condition recoding
factor() Create categorical variables
# Example: Check for complete cases
data <- tibble(
  x = c(1, 2, NA, 4),
  y = c("a", NA, "c", "d")
)
complete.cases(data)
[1]  TRUE FALSE FALSE  TRUE

The Data Cleaning Workflow

Three Steps for Each Variable

  1. Look at the data – use table(), summary(), str()
  2. Identify errors or missing data – note patterns, codes
  3. Transform or combine – create analysis-ready variables

Remember:

  • The new variable may still have missing values
  • Document your cleaning decisions
  • Save both raw and cleaned data
  • Make your cleaning reproducible (script, not manual)

Summary

Key Takeaways

Measurement

  • Validity = appropriateness
  • Reliability = consistency
  • Measurement error introduces bias
  • Missing data mechanisms matter

Sampling

  • Target population ≠ Sampling frame ≠ Sample
  • Probability sampling enables inference
  • Non-probability sampling has trade-offs
  • Always consider who is included/excluded

Data Acquisition

  • APIs provide structured data access
  • Use httr for direct API calls
  • Store API keys securely
  • JSON and XML are common formats

Messy Data

  • Always examine data before analysis
  • Identify and recode missing values
  • Document cleaning decisions
  • Make cleaning reproducible

Readings for This Week

Telling Stories with Data (TSwD):

  • Chapter 6: Farm data (6.2-6.4)
  • Chapter 7: Gather data (7.2 APIs)

Regression and Other Stories (ROS):

  • Appendix A.6: Working with messy data

Next Week

Week 4: Data Visualisation

  • Creating effective visualisations with ggplot2
  • Principles of good graphics
  • Choosing appropriate plot types
  • Comparing distributions

Preparation

Install the ggplot2 package if you haven’t already:

install.packages("ggplot2")

References